Skip to content

Improve time-to-first-kernel#747

Draft
maleadt wants to merge 2 commits intomainfrom
tb/ttfx
Draft

Improve time-to-first-kernel#747
maleadt wants to merge 2 commits intomainfrom
tb/ttfx

Conversation

@maleadt
Copy link
Copy Markdown
Member

@maleadt maleadt commented Feb 25, 2026

... by adding a proper precompilation workload and removing some overzealous specialization.
Needs to be properly validated to ensure the @nospecialize on crucial functions like mtlfunction doesn't regress launch overhead.

Before:

❯ hyperfine --warmup=1 'julia --project -e "using Metal; @metal identity(nothing)"'
Benchmark 1: julia --project -e "using Metal; @metal identity(nothing)"
  Time (mean ± σ):      2.695 s ±  0.135 s    [User: 3.360 s, System: 0.401 s]
  Range (min … max):    2.588 s …  3.063 s    10 runs

❯ hyperfine --warmup=1 'julia --project examples/vadd.jl'
Benchmark 1: julia --project examples/vadd.jl
  Time (mean ± σ):      3.827 s ±  0.237 s    [User: 4.471 s, System: 0.380 s]
  Range (min … max):    3.573 s …  4.260 s    10 runs

After:

❯ hyperfine --warmup=1 'julia --project -e "using Metal; @metal identity(nothing)"'
Benchmark 1: julia --project -e "using Metal; @metal identity(nothing)"
  Time (mean ± σ):      1.555 s ±  0.026 s    [User: 2.232 s, System: 0.405 s]
  Range (min … max):    1.516 s …  1.599 s    10 runs

❯ hyperfine --warmup=1 'julia --project examples/vadd.jl'
Benchmark 1: julia --project examples/vadd.jl
  Time (mean ± σ):      2.065 s ±  0.056 s    [User: 2.741 s, System: 0.293 s]
  Range (min … max):    2.007 s …  2.168 s    10 runs

@christiangnrd
Copy link
Copy Markdown
Member

@maleadt Can I rebase?

@maleadt
Copy link
Copy Markdown
Member Author

maleadt commented Feb 28, 2026

Of course! I didn't have time to investigate the failure though.

@christiangnrd
Copy link
Copy Markdown
Member

christiangnrd commented Feb 28, 2026

I didn't have time to investigate the failure though.

They were fixed as part of #740. Now the failures should be related.

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metal Benchmarks

Details
Benchmark suite Current: c2099f9 Previous: 1d2f000 Ratio
latency/precompile 30595055750 ns 25549419083 ns 1.20
latency/ttfp 1719044667 ns 2346831687.5 ns 0.73
latency/import 1444265000 ns 1427666042 ns 1.01
integration/metaldevrt 881709 ns 877750 ns 1.00
integration/byval/slices=1 1625167 ns 1568625 ns 1.04
integration/byval/slices=3 20737458 ns 8402792 ns 2.47
integration/byval/reference 1621459 ns 1559958 ns 1.04
integration/byval/slices=2 2743708.5 ns 2629875 ns 1.04
kernel/indexing 514083 ns 627417 ns 0.82
kernel/indexing_checked 500375 ns 608750 ns 0.82
kernel/launch 14417 ns 12667 ns 1.14
kernel/rand 536062.5 ns 576167 ns 0.93
array/construct 7000 ns 6500 ns 1.08
array/broadcast 527917 ns 606708 ns 0.87
array/random/randn/Float32 1059583.5 ns 1011104 ns 1.05
array/random/randn!/Float32 737166 ns 753875 ns 0.98
array/random/rand!/Int64 539417 ns 548708 ns 0.98
array/random/rand!/Float32 542958 ns 586208.5 ns 0.93
array/random/rand/Int64 935917 ns 789709 ns 1.19
array/random/rand/Float32 797541.5 ns 645000 ns 1.24
array/accumulate/Int64/1d 1290917 ns 1260667 ns 1.02
array/accumulate/Int64/dims=1 1914958 ns 1859104.5 ns 1.03
array/accumulate/Int64/dims=2 2317125 ns 2179083 ns 1.06
array/accumulate/Int64/dims=1L 12186541 ns 11673271 ns 1.04
array/accumulate/Int64/dims=2L 9856875 ns 9628146 ns 1.02
array/accumulate/Float32/1d 1070229.5 ns 1121395.5 ns 0.95
array/accumulate/Float32/dims=1 1642750 ns 1571667 ns 1.05
array/accumulate/Float32/dims=2 2074083 ns 1889459 ns 1.10
array/accumulate/Float32/dims=1L 10521791.5 ns 9834209 ns 1.07
array/accumulate/Float32/dims=2L 7366250 ns 7249666.5 ns 1.02
array/reductions/reduce/Int64/1d 1300812.5 ns 1386875 ns 0.94
array/reductions/reduce/Int64/dims=1 1142750 ns 1117250 ns 1.02
array/reductions/reduce/Int64/dims=2 1170125 ns 1152958 ns 1.01
array/reductions/reduce/Int64/dims=1L 2037687.5 ns 2013209 ns 1.01
array/reductions/reduce/Int64/dims=2L 4051479.5 ns 4244083 ns 0.95
array/reductions/reduce/Float32/1d 790042 ns 988750 ns 0.80
array/reductions/reduce/Float32/dims=1 805125 ns 843520.5 ns 0.95
array/reductions/reduce/Float32/dims=2 857833 ns 857917 ns 1.00
array/reductions/reduce/Float32/dims=1L 1361958.5 ns 1326625 ns 1.03
array/reductions/reduce/Float32/dims=2L 1839750 ns 1810667 ns 1.02
array/reductions/mapreduce/Int64/1d 1327916.5 ns 1356437.5 ns 0.98
array/reductions/mapreduce/Int64/dims=1 1139916 ns 1102166.5 ns 1.03
array/reductions/mapreduce/Int64/dims=2 1191916 ns 1149750 ns 1.04
array/reductions/mapreduce/Int64/dims=1L 1992813 ns 1988375 ns 1.00
array/reductions/mapreduce/Int64/dims=2L 3668520.5 ns 3626916 ns 1.01
array/reductions/mapreduce/Float32/1d 769083 ns 1055917 ns 0.73
array/reductions/mapreduce/Float32/dims=1 828458.5 ns 847396 ns 0.98
array/reductions/mapreduce/Float32/dims=2 859729 ns 860979.5 ns 1.00
array/reductions/mapreduce/Float32/dims=1L 1377875 ns 1333042 ns 1.03
array/reductions/mapreduce/Float32/dims=2L 1863479.5 ns 1898125 ns 0.98
array/private/copyto!/gpu_to_gpu 575041.5 ns 633020.5 ns 0.91
array/private/copyto!/cpu_to_gpu 716104.5 ns 804354.5 ns 0.89
array/private/copyto!/gpu_to_cpu 733000 ns 816000 ns 0.90
array/private/iteration/findall/int 1618125 ns 1581312.5 ns 1.02
array/private/iteration/findall/bool 1467937.5 ns 1404916.5 ns 1.04
array/private/iteration/findfirst/int 2130875 ns 2075167 ns 1.03
array/private/iteration/findfirst/bool 2091646 ns 2048750 ns 1.02
array/private/iteration/scalar 3148250 ns 4526479 ns 0.70
array/private/iteration/logical 2714583 ns 2693625 ns 1.01
array/private/iteration/findmin/1d 2641812.5 ns 2518041 ns 1.05
array/private/iteration/findmin/2d 1864125 ns 1820229.5 ns 1.02
array/private/copy 857958.5 ns 568854 ns 1.51
array/shared/copyto!/gpu_to_gpu 85645.5 ns 84291 ns 1.02
array/shared/copyto!/cpu_to_gpu 84333 ns 82875 ns 1.02
array/shared/copyto!/gpu_to_cpu 83833 ns 83000 ns 1.01
array/shared/iteration/findall/int 1615958 ns 1585854.5 ns 1.02
array/shared/iteration/findall/bool 1502875 ns 1421875 ns 1.06
array/shared/iteration/findfirst/int 1734583 ns 1654709 ns 1.05
array/shared/iteration/findfirst/bool 1691375 ns 1643542 ns 1.03
array/shared/iteration/scalar 208917 ns 210375 ns 0.99
array/shared/iteration/logical 2267250 ns 2297959 ns 0.99
array/shared/iteration/findmin/1d 2258875 ns 2134229 ns 1.06
array/shared/iteration/findmin/2d 1880958 ns 1806042 ns 1.04
array/shared/copy 222000 ns 241812 ns 0.92
array/permutedims/4d 2538479 ns 2395583 ns 1.06
array/permutedims/2d 1240667 ns 1158833 ns 1.07
array/permutedims/3d 1854333 ns 1686541 ns 1.10
metal/synchronization/stream 19875 ns 19583 ns 1.01

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Copy Markdown
Member

@christiangnrd christiangnrd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is still WIP, but hopefully this saves you some troubleshooting time.

device_synchronize seems to be broken due to the task-local storage call.

Edit: From a fresh session:

julia> using Metal; length(Metal.global_queues)
1 # Should be 0

Edit 2: Adding empty!(Metal.global_queues) to __init__() seems to prevent the segfault, but that cannot be the proper solution right?

Comment thread src/compiler/execution.jl
the function changes, or when different types or keyword arguments are provided.
"""
function mtlfunction(f::F, tt::TT=Tuple{}; name=nothing, kwargs...) where {F,TT}
function mtlfunction(@nospecialize(f), @nospecialize(tt)=Tuple{}; name=nothing, kwargs...)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This despecialization breaks this inference test. But that seems to be on purpose so maybe remove?

@testset "inference" begin
foo() = @metal dummy()
@inferred foo()
# with arguments, we call mtlconvert
kernel(a) = return
bar(a) = @metal kernel(a)
@inferred bar(MtlArray([1]))
end

Only shows up when commenting out the device_synchronize test.

Comment thread src/compiler/execution.jl
finally
close(cce)
end
@autoreleasepool begin
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this @autoreleasepool do anything? The parent function (@autoreleasepool function (kernel::HostKernel)(args...) is already annotated with one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants